For help on which statistical test to use, and how to implement it in R (or SAS, Stata, or SPSS), have a look at this link But let’s start with descriptive statistics.
Correlation
A correlation measures the strength and the direction of an association between 2 variables.
The strength varies between +1 and -1, where 1 indicates a perfect degree of association and 0 no association at all. The direction can be positive (+) or negative (-).
Common measures of correlations are called: Pearson and Spearman.
For the Pearson’s r correlation, both variables should be normally distributed.
Spearman’s rho instead is a non-parametric rank correlation (data is ranked), which does NOT require the data to be normally distributed.
r and rho are also effect sizes (they should therefore be writte in italic).
According to Cohen, Pearsons’s r correlation coefficients can be roughly classified like this (the same categorization applies to Spearman’s rho):
- .10< x <.30 -> small association
- .30< x <.5 -> medium association
- and >.50 -> large association
The square of r (r2) determines how much variance is shared by the 2 correlated variables
To correlate several variables with each other, use cor()from the stats package
# here we correlate the last 5 columns of the mtcars dataset with each other
cor(mtcars[,4:8], use="complete.obs", method="spearman")
## hp drat wt qsec vs
## hp 1.0000000 -0.52012499 0.7746767 -0.66660602 -0.7515934
## drat -0.5201250 1.00000000 -0.7503904 0.09186863 0.4474575
## wt 0.7746767 -0.75039041 1.0000000 -0.22540120 -0.5870162
## qsec -0.6666060 0.09186863 -0.2254012 1.00000000 0.7915715
## vs -0.7515934 0.44745745 -0.5870162 0.79157148 1.0000000
# with "complete.obs" missing values are handled by casewise deletion
To see the p value use cor.test() (but only works for 1 coefficient)
cor.test(mtcars[,4], mtcars[,5], use="complete.obs", method="spearman")
##
## Spearman's rank correlation rho
##
## data: mtcars[, 4] and mtcars[, 5]
## S = 8293.8, p-value = 0.002278
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.520125
Even better is function rcorr() from the Hmisc package. It gives correlation coefficients and significance levels together.
library(Hmisc)
rcorr(as.matrix(mtcars[,4:8]), type="spearman") # Input must be a matrix!
## hp drat wt qsec vs
## hp 1.00 -0.52 0.77 -0.67 -0.75
## drat -0.52 1.00 -0.75 0.09 0.45
## wt 0.77 -0.75 1.00 -0.23 -0.59
## qsec -0.67 0.09 -0.23 1.00 0.79
## vs -0.75 0.45 -0.59 0.79 1.00
##
## n= 32
##
##
## P
## hp drat wt qsec vs
## hp 0.0023 0.0000 0.0000 0.0000
## drat 0.0023 0.0000 0.6170 0.0102
## wt 0.0000 0.0000 0.2148 0.0004
## qsec 0.0000 0.6170 0.2148 0.0000
## vs 0.0000 0.0102 0.0004 0.0000
Corrplots and heatmaps
if you quickly wanted to convey to the reader the strenght of correlations between several variables, you should use plots.
We have discussed before the scatterplot. But a fancier, and sometimes more useful, way of showing correlation strenghts is a corrplot
Use the function corrplot() from the package with the same name. The size and color of the circles indicate the strength and direction of the correlation. check out the many options.
# install.packages("corrplot")
library(corrplot)
mycorrs <- cor(mtcars[,4:8], use="complete.obs", method="spearman")
corrplot(mycorrs)

Othertimes so-called heat maps are useful to quickly spot clusters of associations amongst many variables.
For a famous example, check out the paper “Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey”, published in Neuron by Kriegeskorte and colleagues (2008)

Now let’s do our own (slightly less fancy) heat map with the mtcars dataset
mycorrs <- cor(mtcars[,4:8], use="complete.obs", method="spearman")
heatmap(mycorrs)

Student’s t-test
The t-test allows us to test
- if the average of a variable differs significantly from a specific number (One-sample t-test)
- if two groups of subjects differ from each other (Independent-samples t-test)
- if two measures, which were taken on the same group of subjects, differ from each other (Paired-samples t-test)
It is important to remember, that with a t-test you can only compare a max of 2 groups, or 2 conditions. Use ANOVA or linear mixed models if you have more groups and/or conditions.
By default, the function t.test() assumes unequal variances and applies the Welsh degrees of freedom (df) modification. To specify equal variances, use the var.equal = TRUE option.
Also by default, a 2-tailed test is performed. Specify alternative="less" or alternative="more" to run a 1-tailed test.
One-sample t-test
- For testing if the mean of a sample differs significantly from 0 (or any other number).
For example, we could test if the average horse power of all cars in the dataset mtcars differs from 0 (which it does), and from 150 (which it doesn’t)
t.test(mtcars$hp, mu = 0)
##
## One Sample t-test
##
## data: mtcars$hp
## t = 12.103, df = 31, p-value = 2.794e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 121.9679 171.4071
## sample estimates:
## mean of x
## 146.6875
t.test(mtcars$hp, mu = 150)
##
## One Sample t-test
##
## data: mtcars$hp
## t = -0.2733, df = 31, p-value = 0.7864
## alternative hypothesis: true mean is not equal to 150
## 95 percent confidence interval:
## 121.9679 171.4071
## sample estimates:
## mean of x
## 146.6875
Independent t-test
- For testing the difference between 2 unrelated samples (i.e. different subjects in two groups or conditions - imagine comparing men with women, or drug 1 with drug 2)
Attention: The specification of the formula depends depending on whether your data is in long or wide format
In long format we specify a numeric and a binary factor (separated by “~”):
# select only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test):
ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1 | ChickWeight$Diet == 2, ]
# average across chicken by time and diet
ChickWeight_temp_long <- aggregate(ChickWeight_temp, by=list(ChickWeight_temp$Time, ChickWeight_temp$Diet), FUN = mean, na.rm = TRUE)
# make a bit nicer
ChickWeight_temp_long <- ChickWeight_temp_long[, c(1:3)]
colnames(ChickWeight_temp_long)[1:2] <- c("Time", "Diet")
# now test if weight differs between diet 1 and 2
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet) )
##
## Welch Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 20.988, p-value = 0.4628
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.98631 30.13885
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
Assume equal variances (this changes the dfs, and slightly the p value):
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE )
##
## Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.4625
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.85473 30.00727
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
Change to 1-tailed t-test (modifies the p value)
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative="less")
##
## Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.2312
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 21.93463
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative="greater")
##
## Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.7688
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -55.78209 Inf
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
Now let’s put the same data in wide format, to compare 2 numeric factors (separated by “,”):
library(reshape2)
ChickWeight_temp_wide <- dcast(ChickWeight_temp_long, Time ~ Diet , value.var = "weight")
colnames(ChickWeight_temp_wide)[2:3] <- c("Diet1", "Diet2")
t.test(ChickWeight_temp_wide$Diet1 , ChickWeight_temp_wide$Diet2 )
##
## Welch Two Sample t-test
##
## data: ChickWeight_temp_wide$Diet1 and ChickWeight_temp_wide$Diet2
## t = -0.74786, df = 20.988, p-value = 0.4628
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.98631 30.13885
## sample estimates:
## mean of x mean of y
## 105.6929 122.6167
It is important to point out that you get the same t and p values and dfs using the data in long or wide format!
Dependent (paired-sample) t-test
- For testing difference between two related samples (i.e. same people tested twice)
We could use the same ChickenWeight data, but this time compare identical chicken across 2 time points.
IMPORTANT: add the option PAIRED
In long format we specify a numeric and a binary factor (separated by “~”):
# select only Time 1 and 2:
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2,]
t.test(ChickWeight_temp$weight ~ as.factor(ChickWeight_temp$Time), paired=TRUE )
##
## Paired t-test
##
## data: ChickWeight_temp$weight by as.factor(ChickWeight_temp$Time)
## t = -15.907, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.190877 -7.129123
## sample estimates:
## mean of the differences
## -8.16
In the wide format, we specify 2 numeric factors (separated by “,”)
library(reshape2)
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2,]
ChickWeight_temp_wide <- dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight")
colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2")
t.test(ChickWeight_temp_wide$time0 , ChickWeight_temp_wide$time2, paired=TRUE )
##
## Paired t-test
##
## data: ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2
## t = -15.907, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.190877 -7.129123
## sample estimates:
## mean of the differences
## -8.16
Mann-Whitney U & Wilcoxon signed ranks tests
One of the assumptions of the t-test, is that the data are normally distributed. However, even in cases when the data is NOT normally distributed, one can use the t-test as long as the sample size is reasonably large: N at least >10, better >100. Thus, the t-test is quite robust for violations of the normality assumption, as long as the sample is not too small.
An alternative to the t-test exists , which does not require normal distribution of the data:
The non-parametric Mann-Whitney U test (also called Mann-Whitney-Wilcoxon, Wilcoxon rank-sum, or Wilcoxon-Mann-Whitney test) can be used as alternative to the independent t-test.
Itis calculated in 2 different ways, depending on the sample size (<20 or >20). In any case, it involves some ranking of the values of both groups, and does not assume normal distribution of the data.
The Wilcoxon signed ranks test can be used as an alternative to the dependent t-test
Implementation of these nonparametric tests is easy, and the formula almost identical to the t-test one.
Here an example of the Mann-Whitney U test:
# select again only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test or Mann-Whitney):
ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1 | ChickWeight$Diet == 2, ]
wilcox.test(ChickWeight_temp[,1] , ChickWeight_temp[,2])
##
## Wilcoxon rank sum test with continuity correction
##
## data: ChickWeight_temp[, 1] and ChickWeight_temp[, 2]
## W = 115600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
And here an example of the Wilcoxon signed ranks test:
# select only Time 1 and 2 (applied to all chicken):
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2,]
# put Time in wide format
library(reshape2)
ChickWeight_temp_wide <- dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight")
colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2")
# perform statistics:
wilcox.test(ChickWeight_temp_wide$time0 , ChickWeight_temp_wide$time2, paired=TRUE )
##
## Wilcoxon signed rank test with continuity correction
##
## data: ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2
## V = 8, p-value = 1.132e-09
## alternative hypothesis: true location shift is not equal to 0
Linear models (regression)
Simple regression
A simple linear regression is similar to a correlation: You want to know the link between two variables. In a regression, however, you go further than in a correlation, as you want to know how much y depends on (is predicted by) x.
A regression therefore tries to predict changes in the dependent variable (DV, the thing you measured) based on 1 or more independent variables (IVs, predictors), even for a range of the IV that is yet unknown.
For example, the “mtcars” dataset includes cars that have between 52 and 335 horse power. By fitting a regression model with miles per gallon (mpg) as DV, and horse power (hp) as IV, we can try to predict what the mpg would be for a car with 400 hp.
The general formula for a linear regression is lm(DV ~ IV, yourdata), where DV is the dependent variable (what you measured), and IV the independent one (the factor partly explaining or predicting the DV). Get the result using summary(), or anova().
Let’s test if the horsepower of cars predicts their gas consumption (in mpg):
model1 <- lm(mpg ~ hp, mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
How to read the output:
R-squared (R2) informs us about the amount of variance of mpg that is explained by the model (in this case 60%). Since a simple regression only includes 1 IV, R2 also represents how much variance is explained by hp. If you take the square root of R2, you get the Pearson correlation coefficient r: sqrt(0.6024) = 0.77 Thus, there is a strong correlation between mpg and hp.
The Adjusted R2 indicates how well the model generalizes to the population. This value should be close to R2, but is typically a bit smaller. The adjusted R-squared also adjusts for the fact that models with more IVs (see multiple regression) will automatically have a greater R2.
The significant F-Statistic in the last line informs us that the model predicts well the DV. Since the model only includes 1 IV, the F-Statistic in this case also tells us that hp is a good predictor of mpg.
The most important part of the output is the Coefficients table:
- the estimate in the first line (Intercept), gives the intercept of the model’s line, i.e. how many miles per gallon you could drive if your car had 0 horse power: 30! (our model obviously doesn’t understand much of cars, otherwise it would know that a car with 0 hp does not drive at all)
- The line starting with “hp” gives: the estimate of how much mpg changes for 1-point change in hp (it is the slope, or gradient, of the line best representing the data). The t value indicates the magnitude of the IV’s effect, and the p value its statistical significance. Notice that the estimate for hp is negative, because indeed there is a negative relation between mpg and hp: the stronger the car, the fewer mpg of gas you can drive. The estimate also tells us that for every additional horse power you lose 0.068 mpg. Thus, a car with 400 hp would have 2.8 mpg (the result of 30.09886 - (0.06823 * 400)).
Let’s verify these things by plotting the scatterplot with the fitted regression line. To have the line extend all the way from 0 to 400 hp, we need to include xlim(0,400) and the fullrange=TRUE option.
library(ggplot2)
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
xlim(0, 400) +
stat_smooth(method = "lm", fullrange=TRUE)

In the summary of the regression model, the 3rd line from the bottom provides the degrees of freedom (df = 30).
The statistics can be reported like this: hp significantly predicted mpg, as confirmed by linear regression (t(30) = -6.742, p < .001, R-squared = .6).
Multiple regression
Same as simple regression, only that you want to predict changes in the DV based on several IVs.
A more complicated model, that icludes several predictors, could help explain more variance, and thus achieve a greater R-squared.
You can compare the fit (R2) of different regression models using the function anova().
For example, do we explain more variance by also including the car’s cylinders?
model1 <- lm(mpg ~ hp, mtcars)
model2 <- lm(mpg ~ hp + cyl, mtcars)
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp
## Model 2: mpg ~ hp + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 447.67
## 2 29 291.97 1 155.7 15.465 0.0004804 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes we do. The comparison of the 2 models showed that model2 fits better to the data. The adjusted R2 of the 2nd model (the more complex one that includes 2 predictors) is 0.72, and thus significantly bigger than for the 1st model (0.59).
c("adj R2 mod1: ", round( summary(model1)$adj.r.squared, digits=2) )
## [1] "adj R2 mod1: " "0.59"
c("adj R2 mod2: ", round( summary(model2)$adj.r.squared, digits=2) )
## [1] "adj R2 mod2: " "0.72"
By using multiple instead of simple regression, we were able to explain 13% more variance!
Which of the 2 IVs is more important in mod2? To answer that question, we can look at their estimates, SEs, and t-values:
summary(model2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.9083305 2.19079864 16.846975 1.620660e-16
## hp -0.0191217 0.01500073 -1.274718 2.125285e-01
## cyl -2.2646936 0.57588924 -3.932516 4.803752e-04
Only cyl significantly predicts mpg (as indicated by the p value being < .05).Therefore, the number of cylinders of a car has a bigger effect on gas consumption than horse power.
Another way to look at this is by using the standardized estimates. For that, we use the function lm.beta() from the package QuantPsyc
library(QuantPsyc)
lm.beta(model2)
## hp cyl
## -0.2175294 -0.6710802
The standardized beta estimates indicate that gas consumption increases (mpg decreases) by 0.67 standard deviations (SDs), when the number of cylinders increases by 1 SD. The same change in hp only leads to a 0.21 SD change in mpg.
Similarly, the 95% confidence interval (CI) of the unstandardized beta estimate for cyl does not cross 0, compared to the CI for hp. This clearly indicates that among the 2 predictors, the number of cylinders is more reliable (the slope is always negative) and plays a bigger role.
confint(model2)
## 2.5 % 97.5 %
## (Intercept) 32.42764417 41.38901679
## hp -0.04980163 0.01155824
## cyl -3.44251935 -1.08686785
We can also represent the size of each estimate, and the CI around it, using the function plot_model() of the package sjPlot. I have sorted the biggest effect at the top.
library(sjPlot)
plot_model(model2)

One could also add even more IVs to the model. However, adding displacement (disp) does not significantly improve model fit
model2 <- lm(mpg ~ hp + cyl, mtcars)
model3 <- lm(mpg ~ hp + cyl + disp, mtcars)
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp + cyl
## Model 2: mpg ~ hp + cyl + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 291.98
## 2 28 261.37 1 30.605 3.2787 0.08093 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding weight (wt) does improve the model significantly:
model2 <- lm(mpg ~ hp + cyl, mtcars)
model4 <- lm(mpg ~ hp + cyl + wt, mtcars)
anova(model2, model4)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp + cyl
## Model 2: mpg ~ hp + cyl + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 291.98
## 2 28 176.62 1 115.35 18.287 0.0001995 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Indeed, the R2adj is now 0.83.
But notice how adding weight to the regression, makes hp and cyl to insignificant predictors (their p values >.05). This is the case because the IVs included in the model are not independent of each other. The estimate (slope) of each IV assumes that the other IVs stay constant, and in fact correspond to their average values. Thus, when weight and horse power do not change, the number of cylinders no longer plays a role!!
summary(model4)
##
## Call:
## lm(formula = mpg ~ hp + cyl + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## wt -3.16697 0.74058 -4.276 0.000199 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
library(sjPlot)
plot_model(model4)

Main effects and interactions in multiple regressions
Multiple regressions include more than one predictor (or IV). There are several ways in which you can enter these predictors in the model.
First, you may want to only look at the main (or simple) effects of each predictor separately (just separate them by + signs).
Second, you may want to also investigate if the predictors interact with each other.
If you have more than 2 predictors, you can either look at all the main and interaction effects, or specify only certain interactions. The "*" sign specifies main and interaction effects of the 2 IVs on each side of the star. The “:” sign specifies only the interaction of the 2 IVs on each side of the colon.
The table shows you how to write the corresponding formulas.
| a |
3 main effects |
IV1 + IV2 + IV3 |
| b |
3 main effects and 1 2-way interaction |
IV1 * IV2 + IV3 |
| c |
Alternative writing of model “b” |
IV1 + IV2 + IV1:IV2 + IV3 |
| d |
3 main effects and 2 2-way interactions |
IV1 + IV2 + IV3 + IV1:IV2 + IV1:IV3 |
| e |
3 main effects, 3 2-way interactions and 1 3-way interaction |
IV1 * IV2 * IV3 |
| f |
Alternative writing of model “e” |
IV1+IV2+IV3+IV1:IV2+IV1:IV3+IV2:IV3+IV1:IV2:IV3 |
And these are the models using mpg as DV and hp, cyl, and disp as IVs.
model_a <- lm(mpg ~ hp + cyl + disp, mtcars)
model_b <- lm(mpg ~ hp * cyl + disp, mtcars)
model_c <- lm(mpg ~ hp + cyl + hp : cyl + disp, mtcars)
model_d <- lm(mpg ~ hp + cyl + disp + hp : cyl + hp : disp, mtcars)
model_e <- lm(mpg ~ hp * cyl * disp, mtcars)
model_f <- lm(mpg ~ hp + cyl + disp + hp:cyl + hp:disp + cyl:disp + hp:cyl:disp, mtcars)
Plot regression effects
Several packages exist that allow you to quickly visualize the results of your linear models.
A good and versatile package is sjPlot(). Check out this helpful vignette for more details.
Let’s visualize the slope of the effects of hp on mpg
library(sjPlot)
model2 <- lm(mpg ~ hp + cyl, mtcars)
plot_model(model2, type = "pred", terms = "hp")

No we plot the same but for the different cylinders.
library(sjPlot)
model2 <- lm(mpg ~ hp + cyl, mtcars)
plot_model(model2, type = "pred", terms = c("hp", "cyl") )

And here with 3 IVs:
library(sjPlot)
model3 <- lm(mpg ~ hp + cyl + disp, mtcars)
plot_model(model3, type = "pred", terms = c("hp", "disp", "cyl") )

Another commonly used one is the package effects()
library(effects)
model2 <- lm(mpg ~ hp + cyl, mtcars)
plot(predictorEffect("hp", model2))

Also check out function effect_plot() from package jtools
Regression with continuous vs. categorical predictors
In the previous examples of simple and multiple regression, we have used continuous predictors, i.e. columns in the data frame (variables) that contain numbers.
class(mtcars$mpg)
## [1] "numeric"
class(mtcars$cyl)
## [1] "numeric"
class(mtcars$wt)
## [1] "numeric"
Classical continuous variables are things that can occur in a quasi infinite variety (e.g. weight, temperature, or miles per gallon).
We can also use categorical predictors in a regression. These are variables that include 2 or more categories (e.g. male and female, or tall and short, etc).
The number of categories in a categorical variable is small (say between 2 and 5). But the separation between a categorical and a continuous variable is not always clear cut. In fact, as long as the number of elements or possibilities in a variable is relatively small, a categorical variable can be used as a continuous predictor, and the other way around.
The variable “cyl” only contains 3 categories, as cars in the mtcars dataset only have either 4, 6, or 8 cylinders. If we convert it to a factor, it will be treated in the model as a categorical predictor.
unique(mtcars$cyl)
## [1] 6 4 8
mtcars$cyl <- as.factor(mtcars$cyl)
Let’s start with a simple regression in which “mpg” gets predicted by “cyl”
model5 <- lm(mpg ~ cyl, mtcars)
summary(model5)
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2636 -1.8357 0.0286 1.3893 7.2364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
## cyl6 -6.9208 1.5583 -4.441 0.000119 ***
## cyl8 -11.5636 1.2986 -8.905 8.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
## F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
As you can see from the summary, there are now 2 lines for cyl, and they have the names “cyl6” and “cyl8”. This is because lm() has automatically taken the cars with 4 cylinders as the reference level, and compared the cars with 6 and 8 cylinders to the cars with 4 cylinders.
The intercept now describes the average mpg for the cars with 4 cylinders:
mean(mtcars[mtcars$cyl==4,]$mpg)
## [1] 26.66364
The estimate for “cyl6” is -6.9208, which means that cars with 6 cylinders on average can drive 6.9 mpg less than the cars with 4 cylinders. Indeed, the average mpg for cars with 6 cylinders is 19.7, which also corresponds to 26.6 - 6.9.
mean(mtcars[mtcars$cyl==6,]$mpg)
## [1] 19.74286
26.66364 - 6.9208
## [1] 19.74284
The estimate for “cyl8” is -11.5636, which means that cars with 8 cylinders on average can drive 11.6 mpg less than the cars with 4 cylinders. Indeed, the average mpg for cars with 8 cylinders is 15.1, which also corresponds to 26.6 - 11.6
mean(mtcars[mtcars$cyl==8,]$mpg)
## [1] 15.1
26.66364 - 11.5636
## [1] 15.10004
LMMs like to think in lines. Therefore, the sign and the size of the estimate also tell you the slope of the lines connecting the average mpg for cars with 4 cylinders and the average mpg for cars with either 6 or 8 cylinders.
In the 2 plots below, you can see that the blue slope going from 4 to 6 cylinders (on the left) is less steep than the red line going to 8 cylinders (on the right) - nevermind the fact that the x-axis also shows 6 on the right (I recoded the cyl numbers only for display here).
mtcars$cyl_num <- as.numeric(levels(mtcars$cyl))[mtcars$cyl] # cyl as numeric
datatemp1 <- mtcars[mtcars$cyl != 8,] # exclude cars with 8 cylinders
datatemp2 <- mtcars[mtcars$cyl != 6,] # exclude cars with 6 cylinders
library(ggplot2)
gg_6 <- ggplot(datatemp1, aes(x = cyl_num, y = mpg)) +
geom_point(alpha = 0.3, size = 3) +
geom_smooth(data = datatemp1, method='lm',se=FALSE, colour = "blue") +
scale_x_continuous(name = "cylinders", limits=c(3.5, 6.5), breaks=c(4, 6)) +
scale_y_continuous(limits=c(10, 35)) +
ggtitle("4 to 6 cyl") +
theme_bw()
datatemp2$cyl_num <- ifelse(datatemp2$cyl_num == 8, 6, 4) # recode 8 to 6, otherwise the plot is larger
gg_8 <- ggplot(datatemp2, aes(x = cyl_num, y = mpg)) +
geom_point(alpha = 0.3, size = 3) +
geom_smooth(data = datatemp2, method='lm',se=FALSE, colour = "red") +
scale_x_continuous(name = "cylinders", limits=c(3.5, 6.5), breaks=c(4, 6)) +
scale_y_continuous(limits=c(10, 35)) +
ggtitle("4 to 8 cyl") +
theme_bw()
library(cowplot)
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
##
## Attaching package: 'cowplot'
## The following objects are masked from 'package:sjPlot':
##
## plot_grid, save_plot
cowplot::plot_grid(gg_6, gg_8)

Things can become more complicated, if you add more IVs to the model. For example, we could add horse power as numeric.
mtcars$cyl <- as.factor(mtcars$cyl)
model6 <- lm(mpg ~ cyl * hp , mtcars)
summary(model6)
##
## Call:
## lm(formula = mpg ~ cyl * hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7600 -1.8152 -0.1971 1.5012 7.1606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.98303 3.88908 9.252 1.04e-09 ***
## cyl6 -15.30917 7.43456 -2.059 0.04962 *
## cyl8 -17.90295 5.25961 -3.404 0.00216 **
## hp -0.11278 0.04575 -2.465 0.02061 *
## cyl6:hp 0.10516 0.06848 1.536 0.13672
## cyl8:hp 0.09853 0.04862 2.026 0.05310 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.028 on 26 degrees of freedom
## Multiple R-squared: 0.7882, Adjusted R-squared: 0.7475
## F-statistic: 19.35 on 5 and 26 DF, p-value: 5.019e-08
As in model 5, we first get the coefficients “cyl6” and “cyl8”, which reflect the comparison of cars with 6 and 8 cylinders to those with 4 cylinders. The coefficients have slightly changed, because now the comparison occurs for cars with an average horsepower.
Then follows one line showing the coefficient for the predictor “hp”. It is only one line, because this variable was entered as numeric.
Finally, we have the coefficients for the interaction of cyl and hp. More specifically, “cyl6:hp” indicates how the relationship (slope) between mpg and hp changes from cars with 4 cylinders to those with 6 cylinders. Similarly, “cyl8:hp” indicates how the slope for hp changes from cars with 4 cylinders to those with 8 cylinders.
Assumptions for linear regression
There are a number of assumption the linear regression model will make about the data.
Linearity
Plot the model’s residuals (the distances from each data point to the fitted regression line), which ideally should be equally distributed around the horizontal line. If they aren’t, the assumption of linearity might be violated. In that case, try transforming the data, e.g. with a log-transform. Or maybe you missed some important factor (DV) in the model.
To inspect the residuals, you can plot them against the fitted values:
# rm(mtcars) # delete the mtcars in the environment
model6 <- lm(mpg ~ wt, mtcars)
plot(fitted(model6),residuals(model6))

Or plot the residuals agains the actual data:
model6 <- lm(mpg ~ wt, mtcars)
model6_res = resid(model6)
plot(mtcars$wt, model6_res, ylab="Residuals", xlab="Car weight" ) + abline(0, 0)

## integer(0)
Absence of collinearity
Collinearity basically means that 2 or more of the IVs are highly correlated with each other. This can be problematic for the model, and its interpretation. It also means that will deal with larger standard errors and confidence intervals, resulting in fewer chances to reject H0.
One (approximate) way to check for possible collinearity, is to see if some of the IVs in the model are highly correlated. For that, create a pair-wise correlation plot.
rm(mtcars)
round(cor(mtcars[, c(2:4, 6)]), 2) # correlate 4 IVs of interest
## cyl disp hp wt
## cyl 1.00 0.90 0.83 0.78
## disp 0.90 1.00 0.79 0.89
## hp 0.83 0.79 1.00 0.66
## wt 0.78 0.89 0.66 1.00
Or this one which is even nicer
library(corrplot)
cor1 = cor(mtcars[, c(2:4, 6)])
corrplot.mixed(cor1, lower.col = "black")

Another sign of possible collinearity is a very high R2, when at the same time most of the coefficients are not significant.
To more formally quantify collinearity you can use the variance inflation factor (VIF). For example, the function vif() from the package car will compute the VIF of each IV by regressing it onto the other IVs of the model (The actual formual is 1/1-R2 of that IV).
A VIF of over 10 typically indicates high multicollinearity, while a VIF < 4 or 5 indicates acceptable collinearity.
Although model7 includes the correlated IVs cyl and disp, its VIF is not very high.
model7 <- lm(mpg ~ cyl + disp + hp + wt, mtcars)
car::vif(model7)
## cyl disp hp wt
## 6.737707 10.373286 3.405983 4.848016
It seems like the IV “disp” is particularly problematic. The VIFs are all much lower if we remove it from the model.
model8 <- lm(mpg ~ cyl + hp + wt, mtcars)
car::vif(model8)
## cyl hp wt
## 4.757456 3.258481 2.580486
The Farrar-Glauber test and other model-wide diagnostics measures of collinearity can be computed with the omcdiag() function in the mctest package
library(mctest)
omcdiag(mtcars[, c(2:4, 6)], mtcars$mpg)
##
## Call:
## omcdiag(x = mtcars[, c(2:4, 6)], y = mtcars$mpg)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.0115 0
## Farrar Chi-Square: 123.9875 1
## Red Indicator: 0.8131 1
## Sum of Lambda Inverse: 25.3650 1
## Theil's Method: 0.7094 1
## Condition Number: 26.8753 0
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
Out of 6 tests, collinearity was detected by 4. This suggests that multicollinearity is likely.
The function imcdiag includes the VIF and 6 other multicollinearity diagnostics, for each IV individually:
library(mctest)
imcdiag(mtcars[, c(2:4, 6)], mtcars$mpg)
##
## Call:
## imcdiag(x = mtcars[, c(2:4, 6)], y = mtcars$mpg)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein
## cyl 6.7377 0.1484 53.5519 83.1967 0.3853 -0.5667 1
## disp 10.3733 0.0964 87.4840 135.9126 0.3105 -0.8724 1
## hp 3.4060 0.2936 22.4558 34.8867 0.5418 -0.2864 0
## wt 4.8480 0.2063 35.9148 55.7962 0.4542 -0.4077 0
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## cyl , disp , hp , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.8486
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
If you only want the VIF, you can specify so in the call of the formula:
imcdiag(mtcars[, c(2:4, 6)], mtcars$mpg, method="VIF")
##
## Call:
## imcdiag(x = mtcars[, c(2:4, 6)], y = mtcars$mpg, method = "VIF")
##
##
## VIF Multicollinearity Diagnostics
##
## VIF detection
## cyl 6.7377 0
## disp 10.3733 1
## hp 3.4060 0
## wt 4.8480 0
##
## Multicollinearity may be due to disp regressors
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## ===================================
ANOVA (one-way, two-way, within, between, mixed, rep measures)
The Analysis of Variance (ANOVA) is a special case of a linear regression, or linear model, in which the predictor variable(s) are categorical (instead of continuous)
For example, you could have the between-subjects (meaning different for each subject) factor “treatment” with 3 levels: drug A, drug B, and placebo. And/or the within-subjects (meaning every subject does all of them) factor Condition, with 3 levels: Attention to faces, to houses, and control condition.
One-way ANOVA
The fact that linear regression and ANOVA are similar, becomes evident by the fact that you can also look at the results of a linear regression using the anova() function.
Let’s again regress mpg on hp, using the mtcars data.
summary(lm(mpg ~ hp, mtcars))
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Notice that in the last line the F-statistics are given. Indeed, you could also look at the results of linear regression using the anova() function:
anova(lm(mpg ~ hp, mtcars))
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.37 678.37 45.46 1.788e-07 ***
## Residuals 30 447.67 14.92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another way to run an ANOVA and get a results table similar to the one of SPSS, is to use the aov() function
The first argument is the dependent variable (DV). It is followed by the tilde symbol (~) and the independent variable(s) (IVs). Finally you indicate the dataset to be used. You store the analysis in an object, and then look at the results using summary()
It is also possible to get the means and number of subjects per cell using print(model.tables()) - only makes sense with a categorical predictor
Let’s test again if mpg differs by hp
model2 <- aov(mpg ~ hp, mtcars)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.4 678.4 45.46 1.79e-07 ***
## Residuals 30 447.7 14.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And to use a categorical predictor, let’s see if mpg differs by number of cylinders (which it does)
# first convert cyl from numeric to factor
mtcars_temp <- mtcars
mtcars_temp$cyl <- as.factor(mtcars_temp$cyl)
# then run the anova
model3 <- aov(mpg ~ cyl, mtcars_temp)
summary(model3)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 824.8 412.4 39.7 4.98e-09 ***
## Residuals 29 301.3 10.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A quick way to get the means and number of cars per cylinder-group is to use model.tables():
print(model.tables(model3,"means"),digits=3)
## Tables of means
## Grand mean
##
## 20.09062
##
## cyl
## 4 6 8
## 26.7 19.7 15.1
## rep 11.0 7.0 14.0
Do we get the same means with ddply? Yes we do!
ddply(mtcars, c("cyl"), summarise, N_cars = length(cyl), mpg_mean = round(mean(mpg), digits=1))
## cyl N_cars mpg_mean
## 1 4 11 26.7
## 2 6 7 19.7
## 3 8 14 15.1
To display boxplots of mpg by cylinder-group:
boxplot(mpg ~ cyl, data=mtcars_temp)

repeated measures ANOVA
See this introduction to repeated measures Analysis of Variance (rmANOVA) by DataCamp.
There are different ways of computing a rmANOVA. For example, in his book Discovering Statistics Using R Andy Field lists 4 ways.
Of these, the azANOVA() function from the az package is (as of November 2018), not available for the latest R version (3.5.1).
I thus recommend using aov(), or a linear mixed model (LMM) with the function lmer() from the package lme4
…